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Abstract 

This paper reports the results of using a three-layer backpropagation artificial neural 
network to predict item difficulty in a reading comprehension test. Two network structures 
were developed: one with the sigmoid function in the output processing unit and the other 
without the sigmoid function in the output processing unit. The data set which consisted of 
a table of coded test items and corresponding item difficulties was partitioned into a 
training set and a test set in order to train and test the neural networks. To demonstrate 
the consistency of the neural networks in predicting item difficulty, the training and 
testing runs were repeated four times starting with a new set of initial weights. 
Additionally, the training and testing runs were repeated by switching the training set and 
the test set. The mean squared error values between the actual and predicted item 
difficulty demonstrated the consistency of the neural networks in predicting item difficulty 
for the multiple training and testing runs. Significant correlations were obtained between 
the actual and predicted item difficulties and the Kruskal-Wallis test indicated no 
significant difference in the ranks of actual and predicted values. 



I Introduction 

This paper focuses on developing an artificial neural network (ANN) approach to predict item 
difficulty In a standardized reading comprehension test. The rationale for the study was 
motivated by many considerations which can be generally categorized into two broad areas: 
(1) reading research and (2) the ability of ANNs to outperform traditional statistical 
techniques such as multipl regression in prediction studies. 

Reading research 

Identifying the variables which uniquely account for significant variance in the percent 
correct obtained by examinees for each Item in a standardized, group administered reading 
comprehension test is a major focus in reading research for the following reasons. There 
are many potential sources of difficulty in a reading comprehension test which may derive 
from the way in which the prose passages and the reading comprehension questions are 
constructed (Scheuneman, Gerritz, and Embretson, 1989). Test item writers do not usually 
control nor quantify the sources of difficulty and reading researchers are unsure of what 
factors account for the observed item difficulty in a multiple-choice reading comprehension 
test (Embretson and Wetzel, 1987). Researchers have noted that content and test development 
experts cannot reliably estimate the difficulty of a test item (Bejar, 1983). 

ANN versus traditional statistical techniques 

There is growing literature that suggests that ANNs outperform traditional statistical 
procedures such as multiple regression in prediction studies. Studies in which traditional 
statistical methods and ANNs have been compared show a favorable advantage for ANNs in 
time/forecasting (Sharda and Patil, 1992), processing control (Nelson and Illingworth, 
1991), signal processing (Lapedes and Farber, 1987), and predicting an AIDS risk index 
(Lykins and Chance, 1992). A reason which has been offered to account for the better 
performance of ANN over multiple regression is that backpropagation networks (one type of 
ANN) are a form of nonlinear regression and are not bound to the functional fitting inherent 
in multiple regression which utilizes the least-mean-squared error to determine the best 
representative function in a data set (Lykins and Chance, 1992). 

The validity of the studies in which multiple regression is used to predict item 
difficulty is not high. Perkins and Brutten (1991) correlated 24 variables with the item 
difficulty indices from a standardized reading comprehension test and obtained correlation 



coefficients ranging from 0.603 to -0.011. Only five variables correlated significantly at 
the 0.05 level with item difficulty, and these ftve variables were retained for a stepwise 
multiple regression analysis in which item difficulty was the dependant variable. Only four 
of the five variables uniquely accounted for significant variance and the entire model 
accounted for 72.49 percent of variance in the test. When only four of 24 variables account 
for significant variance in item difficulty, the validity of a multiple regression study is 
not high. It is hypothesized that using variables in combination and introducing forms of 
non-linearity might improve the validity of item difficulty studies. Combining variables 
and introducing non-linearity can be accomplished in an ANN by manipulating the input 
variables and by employing the non-linear transfer functions of the neurons in an ANN. 
Thus, the purpose of the study reported in this paper was to train an ANN to predict item 
difficulty in a reading comprehension test and to compare the actual item difficulties with 
the predicted item difficulties in order to determine whether the two sets of values were 
statistically similar or different. 

n Artificial neural networks 

Interest in artificial neural networks as an alternative to conventional algorithmic 
techniques has grown rapidly in recent years. Artificial neural networks attempt to emulate 
sophisticated brain-like functions such as learning and generalization. Researchers from 
diverse fields such as engineering, science, statistics, and mathematics are actively 
involved in developing and applying artificial neural net models to solve problems in 
pattern recognition, signal processing, biological system modeling, data analysis, and 
optimization. 

An artificial neural network is a large parallel information processing network 
composed of many simple non-linear processing elements. Information is stored in a 
distributed fashion throughout the interconnections of the network. Artificial neural nets 
are. specified by the network topology, node characteristics, and the training or learning 
rules. A variety of artificial neural net models have been developed and these include 
backpropagation nets (Werbos, 1974; Parker, 1982; Rumelhart & McClelland, 1986), associative 
memory nets (Hopfleld & Tank, 1985), adaptive resonance nets (Grossberg, 1987), self- 
organizing nets (Kohonen, 1988), and counterpropagation nets (Hecht-Nielsen, 1987). The 
networks differ in that they operate on binary/continuous valued inputs, use 



unsupervised/supervlsed training, and perform classification, clustering or optimization 
tasks. 

The backpropagation neural network 

The neural network model selected for the item difficulty prediction problem addressed in 
this paper is the backpropagation network. A backpropagation network implements a 
modifiable function which maps a set of inputs to a set of outputs. The functional form is 
modified by adjusting the adaptable interconnection weights by means of the backpropagation 
training algorithm. Since the item difficulty prediction problem essentially involves 
determining the functional mapping between the 24 variables (input) and the item difficulty 
(output), a backpropagation neural network can be designed to solve the mapping of the 24 
variables to the corresponding item difficulties (p-values). 

Network architecture 

The backpropagation network is a hierarchical feed-forward network system consisting of two 
or more fully interconnected layers of processing units (artificial neurons). A N-H-M 
backpropagation network refers to a three-layer network with JV, H, and M processing units in 
the first, second, and third layers respectively. A JV-H-I backpropagation network is shown 
in Figure 1. The first, second, and third layers illustrated in this figure are the input 
layer, the hidden layer, and the output layer respectively. Each processing unit is 
represented by a circle and each interconnection between processing units by an arrow. Each 
interconnection is weighted by an adaptive coefficient called the interconnection weight 
(not marked in the figure). The input to the network is represented by the N-dimensional 
vector U-(UyU^,... t u^l and the output by y. For convenience, the outputs of the processing 
units in the input and hidden layers are labelled in the circles representing the processing 
units. By using a training algorithm to adapt the interconnection weights, the 
backpropagation network has the ability to implement a wide range of responses to the 
patterns in a given training set. 




HIDDEN 
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Fig. 1 A three-layer backpropagation neural network 




Processing unit 

Figure 2 shows a typical processing unit which consists of a summing unit and a non-linear 
sigmoidal activation functional./. The output y of the processing unit is given by 

y =JIS], where 

S = £ w i Uj. and 

JIS]= 1/(1 +e' S ). 

That is. the processing unit first computes a weighted sum S of its inputs and then computes 
a function JIS] of the weighted sum to give the activation level y (output) of the 
processing unit. Due to the sigmoidal function, the output y is limited to values between 0 
and I. 

Network design issues 

The design of a backpropagation network for a particular problem involves determining: 

(1) The number of layers. 

(2) The number of processing units in each layer. 

(3) The format of the input to the network. 

The backpropagation approximation theorem (Hecht-Nielsen. 1991) proves that three layers are 
generally enough to approximate the functional mapping required for most practical problems 
and. therefore, a backpropagation network with three layers is a good choice when no prior 
knowledge of the mapping is assumed. The number of processing units in the input layer is 
generally governed by the input dimension. No theory or rules exist to select the number of 
processing units in the hidden layer, therefore, the number of hidden layer units is 
determined empirically. The number of processing units in the output are problem-dependent; 
for example, pattern classification problems require the output layer to have one processing 
unit per pattern class (Gupta, Sayeh. & Tammana, 1990) or one processing unit per pattern 
feature (Gupta & Upadhye. 1991; Gupta et al., 1993). The prediction problem addressed in 
this paper requires one processing unit in the output layer. The input to the 
backpropagation network is a vector of real numbers and due to the fixed structure of the 
network, the input vector must have a fixed dimension. 

Network training 

The multilayer perceptron is trained under supervision using the backpropagation algorithm 
(Rumelhart & McClelland. 1986). The network is presented with pairs of vectors: the input 
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vector to the network and the desired network output vector for the input pattern vector. 
The network functions in two stages during training: a forward pass and a backward pass. In 
the forward pass, the input vector is presented to the network and the outputs of the units 
are propagated through each upper layer until the network output is generated. The 
difference (error) between the network output and the desired output is computed for each 
output unit and during the backward pass, a function of the error is fed back through the 
network layers to adjust the interconnection weights in order to minimize the error. The 
forward and backward passes are repeated until the network converges, that is, until a 
measure of the error is acceptably small. During training, the network gradually learns to 
produce the desired outputs. The backpropagation training algorithm is an iterative 
gradient algorithm designed to minimize the mean square error between the desired network 
output and the actual network output. The backpropagation algorithm applied to train the 
neural network shown in Figure 1 is summarized below. 

Network Dimensions: 

Let the dimension of the input training vector and the network input layer be N and let H be 
the dimension of the hidden layer. 

Network Initialization: 

Set all the interconnection weights to small random values with zero mean. Typically, the 
weights are initialized to take random values between -0.5 and 0.5. 

Apply Input and Set Desired Net Outputs: 

Assume that the network is designed to predict Af values and let u , i=1.2 N represent 

a N dimensional training vector for the mth value. Let d^ be the desired network output for 

the input u The presentation of the input may be done in several ways. One approach 
Cm 

is to apply the input training vector u^ m . set the desired network output d^, and not 
change the training vector during the training iterations (an iteration is a forward pass 
and a backward pass) until the network converges to the desired output. The process is 
repeated for the remaining training vectors. Alternatively, the training vectors can be 
rotated cyclically from iteration to iteration. The desired network output must, therefore, 
be also set from iteration to iteration. 
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Compute Actual Unit Outputs (forward pass): 

The processing of the outputs are carried out sequentially from the input layer to the 
output layer. The output of the jth unit in the input layer is given by: 




The outputs of the input layer are the inputs to the hidden layer and the output of the kth 
unit in the hidden layer is given by: 



The outputs of the hidden layer are the inputs to the output layer and the network output in 
response to the input u is given by: 



In the above equations, jl.] is the sigmoldal function, that is, 
M = 1/11+ e*] 
1 2 

and w , w and lu, are the connection weights between the network input and the input 
ij J iJc K 

layer, input and hidden layer, and ihe hidden layer and the output layer, respectively. 

Update Weights (backward pass): 

Hidden layer - Output layer interconnection weights: 

The interconnection weights are updated sequentially from the output layer to the input 
2 

layer. If wjt) is the interconnection weight between kth unit in the hidden layer and the 

rC 

output layer unit at time t, then the weight at time (t+1) is given by: 




i .fc j 



u,], UksH. 



(Tr H 2 2 7 



w l (t+1)=>w k (il + ^m u k' 



where 
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m m m mm 

6 is the error for the output of the unit in the output layer when the input is u, and r\ 
m Um 

is a gain term which controls the learning (adaptation) rate of the network. The gain term 
t\ is typically assigned values between 0 and J, however, the actual value selected between 0 
and 1 is application-dependent. The gain term controls the convergence rate and the 
stability of the network and in practice. r\ is adjusted for fast adaptation and for 
obtaining stable estimates of the interconnection weights. 

Input layer - Hidden layer interconnection weights: 

The updated weights between the input and hidden layers are given by: 

w j.k (t+1) = w j,k (t)+ ^k u r 

where 

m= 1 

8^ is the error for the output of unit Jc in the hidden layer, 
Input - Input layer interconnection weights: 

The updated weights between the input and input layer are given by: 



ur /t+ij - w. It) +t]8.u. . 
(J tj j um 



where 



8^ is the error for unit J in the input layer. 

The training set generally consists of a set of representative prototype vectors. 
The training prototypes could be a vector representation of the raw input data or a feature 
vector computed from the raw data. Network convergence can be tested in several ways (Gupta 
et al., 1992). The most practical test for network convergence is the error limit test, 
i.e., test if the absolute difference e between the desired response and the response of 
each output unit is below a small specified error limit. Alternatively, training could be 
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terminated when the sum of the squares of the errors for all output units is below a 
specified limit. 

Network testing 

In the testing stage, the vector representing the input to be tested is presented to the 
input of the trained network and the network outputs are computed using one forward pass. 
In neural networks designed for classification problems, the input test pattern is assigned 
to the class of the network output that yields the maximum value (maximum response rule). 
For the prediction problem involving a single network output, the network is trained to 
predict m values. The value of the network output during testing is, therefore, the 
predicted value. 

HI Purpose and design of the study 

The specific data set to be analyzed in this paper comes from a study by Perkins and Brutten 
(1991). Three classes of variables were examined: (1) various counts of text surface 
structure (Drum, Calfee, and Cook, 1981), (2) prepositional analysis of the passages and 
item stems (Scheuneman and Gerritz, 1990; Scheuneman, Gerritz and Embretson, 1989) and (3) 
cognitive demand (Scheuneman, Gerritz and Embretson, 1989). 

Text structure 

The variables describing the structure of the texts included passage content (humanities, 
nonhumanities), the number of paragraphs per passage, the numbei of lines per passage, the 
number of test items per passage, the number of words per passage, the number of content 
words per passage, the number of sentences per passage, a passage word/sentence ratio, and 
the percent of content words per passage. 

Propositioned analysis 

The following propositional counts were conducted for both the test passages and the item 
stems separately: the number of arguments, the number of modifiers, the number of 
predicates, arguments density (the number of arguments divided by the number of sentences), 
modifier density, predicate density, and combined density (the total number of proposition? 
divided by the number of sentences). 



Cognitive demand 

Scheuneman, Gerritz and Embretson, (1989) used five cognitive process categories in their 
analysis which were modified as follows for the study reported herein: 

0= identify, recognize, name, discern, locate, match, exemplify, or illustrate 

a concrete piece of relevant information in the text which was given in an 

item stem. 

J = non-identification 

- support/weaken a claim, procedure, or outcome; substantiate, demonstrate, 
prove, confirm, verify a result; negate, critique, contradict, or disprove 
a claim, procedure or outcome 

- infer, conclude, induce, deduce, diagnose, distinguish, differentiate, 
contrast 

- generalize, plausibly universalize, find common ground, transfer, 
apologize, apply, carry over 

- problem/solve, calculate, inquire, experiment, evaluate, appraise, weigh, 
compare 

(adapted from Scheuneman, Gerritz and Embretson, 1989, pp. 14-14). 

If a test item required the reader to conduct a cognitive operation on a concrete, verbatim 
piece of information in the text, it was coded 0. All other cognitive requirements were 
coded I. 

IV Method 

Subjects 

Seventy students enrolled in intensive English classes participated as subjects for this 
study. The distribution of native languages was the following: Japanese, sixteen; Chinese, 
thirteen; Arabic, twelve; Korean, eleven; Spanish, nine; Thai, two; Turkish, two; Urdu, two; 
Hebrew, one; Indonesian, one; and Wolof, one. 

Instrumentation 

The elicitation instrument consisted of 29 reading comprehension items from the Test of 
English as a Foreign Language, Form 3LTFG (Educational Testing Service, 1990). 
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Data coding 

Each of the twenty-nine reading comprehension test items was coded according to three sets 
of variables: text structure, propositional analysis of passages and stems, and cognitive 
demand. The two researchers coded the items independently and anonymously. Disagreements 
were adjudicated by a third party, and the consensus was recorded. Pearson correlations for 
the continuous variables and percentage agreement for the categorical variables were used to 
determine coding stability consistency. The coefficients ranged from 0.85 to 0.93. The 
item difficulty of each item was calculated as the proportion of correct responses. 

Network training and testing 

The data available consisted of the 29 reading comprehension coded test items shown in Table 
1. Two sets: a training set and a test set were created from the available data set. In 
order for the training set to be representative, items were selected by picking the first 
item and every other item in the table to give a total of J 5 items in the training set. The 
remaining 14 items constituted the test set. A three-layer backpropagation network was 
designed with 24 input units (one for each of the 24 input variables) and J output unit for 
the predicted item difficulty (p-value). The number of units in the hidden layer was 
empirically determined to be J 7. The input data were normalized to take values between 0 
and I by dividing each variable by its highest value in the table. Two variants of the 24- 
17-1 networks were implemented: one with the sigmoid function in the output processing unit 
and the other without the sigmoid function in the output processing unit. The rationale for 
implementing the network without the sigmoid function in the output unit was to determine 
the effect, if any, of the sigmoid function compressing heavily large and small input 
values. All processing units in the input and hidden layers used the sigmoid function. The 
gain term t| used in training was 0.2 and the error limit test with e=0.05 was nsed to test 
for network convergence. 

V Results 

The neural networks were trained using the backpropagation training algorithm to output the 
desired p-values in the training set. The 14 items in the test set were tested and the 
results (Run 1) are shown in Tables 2 and 3. In order to demonstrate the consistency of the 
neural networks in predicting item difficulty, the two networks were trained starting with a 
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s 



1 



0330750230 



0775" 0.00 2.00 



030030230 



0370750 2750 



03? 



ST 



bp 



TO 



GO 



TT3 



TT3 



TUT 



30 



op 



8 



13 



S3 



3001523 5T3 430 743 533 



l I 



i 



z z 



,00(34.0" 55172535' 



,0034.5^2571 



;00B4.0 591T2575" 



00134.0 3^0" 2535" 



5O700577T 



507750577T 



SO70O577T 



507005771 



f 



3> 



t 



3 



T53 



153 



T5 



ITT 



00 



53 



157 



53 



157 



00 



3 



T07 



57 



107 



57 



107 



57 



T07 



7302230354750530 



7755" 223 03? 3150" 3750 2750 5775 0.75 0.50 100" 0750 



700" 227 0.68 100 5.00 



7750 2274" 035 



a 



TT3 55U 037 ZOO 100 100 



8 



00330230135 



53 



00 



a 

E 



00 03? 



175007750715100050 



roo 



1750 03? 035 2750 



2? 



3 



SI 



??03523? 



00 03? 3750 



30 



003750030 



00 



050 



00 



750 



153 6.00 82.0 61.0 46.0 7.45 535 



715" 



T7T2S7 



?5 



rro 



S53 037 



00100 



1750 



03T 



150 



103 



15.0 6.00 82.0 61.0 433743 



535 



Iff 



17X257 



155 



TT3 233 037 4750 



750 



750 



133035 



33" 3730 



031 



750 



roo 



153 6.00 82.0 5T3 433 743 535 4718" 



T77I 



037 



33 



T750 



53 6.00 82.0 5T3 433 



775535 



T5 



257 
287 



T55 
63 



11. 



TO 



0770 



150" 



130 



93 6.00 82.0 5T3 



433743535 



IS 



177 



257 



155" 



TT3 253 037 2750 



TI 



035" 0750 3750 243 6.00 82.0 51 



433435" 



5? 235 



TTJ32J 



204" 



T53 



3 253 037 330 230 T750 130 035 033 2750 0757 



13 2531037 



173 033 330 230 230 T30 



00 230 3750 033 



066 



002300750 



00 2750 T30 030 130 230 0750 



750 



1750 3750 1750 



1750 033 233 



T2 



03ff 0750 3750 243 57501523 523 



53435" 



35235 



T332J 



234" 



153 



173 073 3750 330 



130 



130 



T5 



077T 



030 3750 243 57001523 82.0 43.0 435" 



55235 



TT3 



323 



204" 



153 



T4 



037 



030 



3750 



243 6.00 82.0 523 433 43? 



53235 



TT3 



323 



204 



T53 



173 0.63|100 
173 033 



130230 



130 030 



1750 230 



750 



230 



130 



100 1750 030 



130230 



T30 



1750 033 333" 0750 



T5 



033 



07505750 



243 



6.00 82.0 



23 453 43? 43? 235 



TT3523 



204" 



153 



1730755530530 



1750 2750 



T? 



03? 



0750330243 



?30|S23 523 453 43? 43? 235 



TT3 



523" 203 



T53T73 033 



530 



1750 3750 13? 033 



130030 



130 5750 



T750 



17 



035" 0750" 



1750 



T53 



6.00 157.0 453 



5?3 7715 5750 430 173 220 



123 



530 



273 03? 233 



1750 



330 



15 



074" 



0750 



130 



T53 6.00 157.0 430553 



7715 



15 



0715 



030 



130 



T53 57501573 453 553 



77 



5750430 
IT 530 530 



173 220 



124" 



173223 



124" 



8.00|27.5 036 47002750 
530 273 03? 4750 230 



130 



T750 



133 



0.66 033 27330750 



035 



030" 



130 



2T 



033030 



1750 



22 



035 



0750" 



163 



?750t573 453 353 77TT 530 430 



173 220 



T24" 



300 273 03? 4700 



130230 



135 



153 57501573 453 553 7715 530 430 



173220 



124" 



1750 



T30 



T53 5301573 430 36.0 7.13 



300430 



173 



220 



124 



8.00 27.5 03? 2750 
530 273 03? 3.00 |l. 00 12.00 



1.00 0.66 033 033 



135 030 



T750 035 075? 2750 0750 



153 03? 273 4700 T73TM 2750 030 530 030 



25 



077? 



130 



130 



123 



6.00 B6.0 253 553 532 2.56 422 



153 T77 



mm 



175 0.00 



24" 



071? 



130" 



1750 



123 5750P53 253 553 



5322354725 



153 177 



25 



030 



T750 



750 



123 5301553 253 553 532 23? 435 



153 177 



100 19.00 
TOO 



153 03? 430 430 530 T30 



17500775 



2? 



030 



T750 



130 



123 57501553 253 553 532 23? 4725 



T53T77 



TOO" 9.00 19.6 03? 2.00 2.00 100 1.00 1.00 1.00 3.00 0.00 



T53 03? 5750 4750 2750 173 135 03? 530 0750 



27 



03 



T30 



130 T23 57501553 253 553 532 23? 435 



153 T77 



25 



035 



T30 



T30 



T23 57501553 253 553 532 275? 4725 



153 177 



TOO" 
TOO 



930 
530 



T53 03? 530 230 



T30 



,00 0.66 033 100 



130 



25 



037 



T750 



T33 123 57501553 233 



553 532 23? 435 



153 177 



TOC 



5750 



153 03? 4750 430 



1750 



T30 



1.00 0.25 125 0.00 



ERIC 



15 



14 



Tabic 2 Predicted p-values with the sigmoid function (original training set and test set) 
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0.907 


A AA< 


A OCA 

0.859 




0.925 


0.933 


U.942 


A AA1 

0.901 


0 fA 


0.713 


0.725 


A t^A 

0.739 


0.753 


0 01 

U.7 1 


0.830 


0.866 


0.844 


0.867 




0.601 


0.685 


0.651 


0.619 




0.709 


0.719 


0,718 


0.754 


U. / 1 


0.708 


0.692 


0.694 


0.671 


n <i 


0.242 


0.161 


0.284 


0.208 


n *i 

u.jj 


U.OUO 






A CQA 


0.19 


0.407 


0.398 


0.396 


0399 


0.33 


0.433 


0.425 


0.413 


0.387 


0.76 


0.585 


0.577 


0.633 


0.675 


0.30 


0.272 


0.261 


0.276 


0.279 


0.34 


0.295 


0.291 


0.313 


0.331 


0.27 


0.195 


0.183 


0.191 


0.169 


MSE 


0.0165 


0.0197 


0.0134 


0.0172 



X 2 = 5.23, ns. t 0.01 level, df 4 



Table 3 Predicted p-values without the sigmoid function (original training set and test set) 



Actual 


Predicted p-value 


p- value 


Run 1 


Run 2 


Run 3 


Run 4 


0.90 


0.837 


0.886 


0.878 


0.889 


0.90 


0.924 


0.987 


0.955 


0.983 


0.66 


0.691 


0.717 


0.698 


0.747 


0.91 


0.868 


0.905 


0.863 


0.856 


0.67 


0.587 


0.678 


0.635 


0.597 


0.49 


0.688 


0.685 


0.678 


0.751 


0.71 


0.704 


0.686 


0.704 


0.679 


0.53 


0.288 


0.233 


0.392 


0.257 


0.53 


0.583 


0.550 


0.512 


0.528 


0.19 


0.411 


0.410 


0.401 


0.402 


0.33 


0.423 


0.435 


0.420 


0.401 


0.76 


0.628 


0.547 


0361 


0.635 


0.30 


0.313 


0.277 


0.284 


0.346 


0.34 


0.323 


0.302 


0.286 


0.348 


0.27 


0.229 


0.178 


0.167 


0.191 


MSE 


0.0128 


0.0169 


0.0112 


0.0161 



X 2 = 5.362, n.s.,0.01 level, df 4 
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new set of initial weights taking random values between -0.5 and 0.5. This form of training 
was repeated three times and the results of testing the 14 items are also shown in Tables 2 
and 3 (Runs 2-4). Additionally, the training set and test set were switched i.e. the new 
training set consisted of the 14 items in the original test set and the new test set 
consisted of the 15 items in the original training set. The training of the two networks 
was repeated using exactly the same set of initial weights used in the original 4 runs and 
the test results are shown in Tables 4 and 5. The mean squared error (MSE) computed from 
the actual and predicted p-values are also shown for each run in Tables 2-5. The MSE 
provides a measure for evaluating the consistency in the performance of the networks for the 
multiple training runs and also provides a measure for comparing the performances of the two 
networks. The small MSE values and the little difference in the MSE values obtained not 
only demonstrate bow effective the networks are in predicting the item difficulty but also 
the consistency in the predictions from run to run. The average of the MSE values obtained 
for the networks with and without the sigmoid function (0.0129 and 0.0133 respectively) also 
show that the performance of the network with the sigmoid function is marginally superior to 
that of the network without the sigmoid function. 

A correlation analysis and the Kruskal-Wallis test were also employed to assess how 
accurately the neural network predicted the item difficulty values. The correlation 
matrices corresponding to the runs in Tables 2-5 are shown in Tables 6-9 and all correlation 
coefficients reported in the tables are significant at the 0.0J level for a one-tailed test. 
The Kruskal-Wallis test, a nonpar ametric alternative to the one-way analysis of variance, 
was utilized to determine whether there was a difference between the actual p-values and the 
predicted p-values for different test runs. The Kruskal-Wallis test was selected because 
neither normality of distribution nor homogeneity of variance for the groups of p-values 
under study could be assumed. The Kruskal-Wallis test statistic is calculated from the sums 
of ranks for the different samples of p-values, and the interpretation in this paper is that 
of a hypothesis of equal means. For df 4 at the 0.0J level, the critical tabled value for 
the chi-square statistic is 13.277. For Tables 2-5. the calculated statistic for each table 
of values is smaller than the tabled value; therefore, it can be concluded that no 
significant difference in the ranks has been established and further that the predicted p- 
values are statistically equal to the actual p-values. 
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Table 4 Predicted p-values with the sigmoid function (switched training set and test set) 



Actual 


Predicted p-value 


revalue 


Run I 


Run 2 


Run 3 


Run 4 


0.76 


0.773 


0.818 


0.814 


0.775 


0.87 


0.887 


0.878 


0.882 


0.908 


0.90 


0.862 


0.780 


0.739 


0.769 


0.51 


0.526 


0.357 


0.491 


0.512 


0.70 


0.686 


0.644 


0.622 


0.651 


0.60 


0.581 


0.597 


0.547 


0.613 


0.67 


0.716 


0.699 


0.693 


0.737 


0.56 


0.637 


0.600 


0.535 


0.584 


0.44 


0.239 


0.224 


0.232 


0.236 


0.43 


0.292 


0.302 


0.267 


0284 


0.39 


0.323 


0.341 


0.312 


0.330 


0.16 


0.267 


0.336 


0.298 


0.332 


0.50 


0.407 


0.433 


0.418 


0.439 


0.59 


0.689 


0.677 


0.542 


0.619 


MSE 


0.0074 


0.0096 


0.0102 


0.0091 



X 2 = 5.234, n.s., 0.01 level, df 4 



Table 5 Predicted p-values without the sigmoid function (switched training set and test set) 



Actual 
p-value 


Predicted p-valuc 


Run 1 


Run 2 


Run 3 


Run 4 


0.76 


0.829 


0.852 


0.864 


0.805 


0.87 


0.861 


0.863 


0.843 


0.936 


0.90 


0.846 


0.729 


0.628 


0.662 


0.51 


0.553 


0.569 


0.486 


0.539 


0.70 


0.687 


0.666 


0.625 


0.C 


0.60 


0.605 


0.592 


0.501 


0.576 


0.67 


0.724 


0.713 


0.696 


0.717 


0.56 


0.606 


0.652 


0.553 


0.488 


0.44 


0.238 


0.222 


0.221 


0.200 


0.43 


0.301 


0.323 


0.286 


0.254 


0.39 


0.332 


0.351 


0.333 


0299 


0.16 


0.267 


0.341 


0.294 


0.315 


0.50 


0.394 


0.439 


0.425 


0.441 


0.59 


0.534 


0.613 


0.433 


0.464 


MSE 


0.0072 


0.0107 


0.0158 


0.0154 



X 2 = 6.016. n.s.. 0.01 level, df 4 
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Table 6 Correlation matrix with the sigmoid function (original training set and test set) 





i 


L 




4 


5 


1 


Actual p-value 




0.850 


0.836 


0.879 


0.726 


2 


Predicted p-value, Run 1 






0.991 


0.995 


0.987 


3 


Predicted p-value, Run 2 








0.991 


0.987 


4 


Predicted p-value, Run 3 










0.991 


5 


Predicted p-value, Run 4 













all correlations are significant for df 13. p<0.01 for a one tailed test 



Tabic 7 Correlation matrix without the sigmoid function (original training set and test set) 





1 


2 


3 


4 


5 


1 


Actual p-value 




0.879 


0.856 


0.896 


0.856 


2 


Predicted p-value, Run 1 






0.890 


0.981 


0.989 


3 


Predicted p-value, Run 2 








0.983 


0.984 


4 


Predicted p-value, Run 3 










0.975 


5 


Predicted p-value, Run 4 













all correlations are significant for df 13, p<0.01 for a one tailed test 



Table 8 Correlation matrix with the sigmoid function (switched training set and test set) 





1 


2 


3 


4 


5 


1 


Actual p-value 




0.919 


0.877 


0.898 


0.888 


2 


Predicted p-value, Run 1 






0.987 


0.969 


0.982 


3 


Predicted p-value, Run 2 








0.984 


0.991 


4 


Predicted p-value, Run 3 










0.991 


5 


Predicted p-value, Run 4 













all correlations are significant for df 12, p<0.0l for a one tailed test 



Table 9 Correlation matrix without the sigmoid function (switched training set and test set) 





1 


2 


3 


4 


5 


1 


Actual p-value 




0.922 


0.857 


0.839 


0.845 


2 


Predicted p-value, Run 1 






0.977 


0.952 


0.953 


3 


Predicted p-value, Run 2 








0.966 


0.961 


4 


Predicted p-value, Run 3 










' 0.976 


5 


Predicted p-value, Run 4 













all correlauons are significant for df 12, p<0.01 for a one tailed test 
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VI Discussion 

The ability of the neural network to predict the p-values Is highly dependent on the 
training set. The training set must be large enough to be representative of the data in 
order to approximate the desired input/output mapping during training. In the experiments 
conducted, the training set was relatively small; nevertheless, the prediction of the item 
difficulties was quite reasonable. A significant Improvement in the prediction of the p- 
values by the neural networks can. therefore, be expected when a larger training set becomes 
available. No prior assumptions of the functional mapping between the 24 variables and the 
corresponding p-values. the significance of the variables, or the relationships between the 
variables were made. The functional mappng was approximated by the network during training 

with a part of the available data. 

The next phase of our research will involve the identification of variables or types 
of variables for which predictions are most sensitive. By dropping variables and comparing 
the magnitude of change In predicted item difficulty, it should be possible to drastically 
reduce the number of predictor variables from 24 to a more manageable number and still 
maintain close concurrence between the actual and predicted values. 

VII Conclusion 

This paper focused on developing a neural network approach to predict item difficulty in a 
standardized reading comprehension test. The results obtained from the two backpropagation 
neural networks designed for the prediction problem clearly demonstrate that the networks 
can consistently predict item difficulty with a high degree of success. The results of 
training an ANN to predict item difficulty in a reading comprehension test have direct 
application to pretesting and provide application to other items. The use of ANNs should 
further inform the testing community of what determines item difficulty and provide a basis 
for generalizing about how selected variables affect item difficulty. 
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